RLadies Rotterdam

Motivation

After PhD - I wanted to read more!

Question




How should I choose what to read?

Question




How should I choose what to read?




Am I reading diversely?

Picking from lists

Questions for our data




Are lists like this biased?

Outline


1. Get to know the basics of webpages


2. Look at some examples of webscraping


3. Get some data to answer my questions

Packages we need

library(rvest)
library(tidyverse)
library(devtools)
devtools::install_github("famguy/rgoodreads")

Disclaimer



I am not a web scraping expert

Disclaimer



I am not a web scraping expert



And sometimes I write naughty code …

Disclaimer



I am not a web scraping expert



And sometimes I write naughty code …



But I can share what I know

Webpage basics

Go to a webpage

View html code in Chrome

  • Right click the part of the page you want
  • Select inpsect

Html code

  • Brings up the html code
  • Highlights the piece of html code related to your click
  • Hover over html code to see other features of the web page

Inpsect button

  • Similarly, click the top left button in the side panel
  • Explore related features of the webpage and html code

Basic html types

Structure: <tag> Some stuff </tag>

Basic tag types are:

div - Division or section

table - Table

p - Paragraph elements

h - Heading

Webscraping basics

Read a webpage

library(rvest)
author_url <- "https://en.wikipedia.org/wiki/Tim_Winton"
wiki_data <- read_html(author_url) # Scrape the data from the webpage
wiki_data
## {xml_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...

How to scrape a table - html_table()

table_data <- wiki_data %>% 
  rvest::html_table() #Get all tables on the webpage

length(table_data)
## [1] 3
str(table_data[[1]])
## 'data.frame': 7 obs. of 2 variables:
## $ Tim Winton: chr "Born" "Occupation" "Nationality" "Period" ...
## $ Tim Winton: chr "4 August 1960 (1960-08-04) (age 58)[1]Karrinyup, Western
Australia" "Novelist" "Australian" "1982–present" ...

Other approaches - html_nodes()

table_data_eg1 <- wiki_data %>%
 rvest::html_nodes("table") %>% # get all the nodes of type table
 purrr::pluck(1) %>% #pull out the first one
 rvest::html_table(header = FALSE) #convert it to table type
str(table_data_eg1)
## 'data.frame': 8 obs. of 2 variables:
## $ X1: chr "Tim Winton" "Born" "Occupation" "Nationality" ...
## $ X2: chr "Tim Winton" "4 August 1960 (1960-08-04) (age 58)[1]Karrinyup,
Western Australia" "Novelist" "Australian" ...

Other approaches - html_node()

table_data_eg2 <- wiki_data %>%
  rvest::html_node("table") %>% # just get the first table match
  rvest::html_table(header = FALSE) #convert it to table type
str(table_data_eg2)
## 'data.frame': 8 obs. of 2 variables:
## $ X1: chr "Tim Winton" "Born" "Occupation" "Nationality" ...
## $ X2: chr "Tim Winton" "4 August 1960 (1960-08-04) (age 58)[1]Karrinyup,
Western Australia" "Novelist" "Australian" ...

Get the Nationality

author_nationality = table_data_eg2 %>%
  dplyr::rename(Category = X1, Response = X2) %>%
  dplyr::filter(Category == "Nationality") %>%
  dplyr::select(Response) %>%
  as.character()
author_nationality
## [1] "Australian"

Can we generalise?

Same example - Different author

Generalise the web page

author_first_name = "Jane"
author_last_name = "Austen"
author_url <- paste("https://en.wikipedia.org/wiki/", 
  author_first_name, "_", author_last_name, sep = "")
wiki_data <- read_html(author_url)

Let's get that table

table_again <- wiki_data %>%
  rvest::html_nodes(".infobox.vcard") %>% #search for a class
  rvest::html_table(header = FALSE) %>% 
  purrr::pluck(1) 
head(table_again)
##                     X1
## 1          Jane Austen
## 2 Portrait, c. 1810[a]
## 3                 Born
## 4                 Died
## 5        Resting place
## 6            Education
##                                                                  X2
## 1                                                       Jane Austen
## 2                                              Portrait, c. 1810[a]
## 3 (1775-12-16)16 December 1775Steventon Rectory, Hampshire, England
## 4  18 July 1817(1817-07-18) (aged 41)Winchester, Hampshire, England
## 5                          Winchester Cathedral, Hampshire, England
## 6                                       Reading Abbey Girls' School

Web scraping is tricky

No nationality category in Jane Austen's table

##  [1] "Jane Austen"          "Portrait, c. 1810[a]" "Born"                
##  [4] "Died"                 "Resting place"        "Education"           
##  [7] "Period"               "Relatives"            ""                    
## [10] "Signature"

Differnt way - Matching paragraphs

para_data <- wiki_data %>%
  rvest::html_nodes("p") # get all the paragraphs
head(para_data)
## {xml_nodeset (6)}
## [1] <p class="mw-empty-elt">\n\n\n\n</p>
## [2] <p><b>Jane Austen</b> (<span class="nowrap"><span class="IPA nopopup ...
## [3] <p>With the publications of <i><a href="/wiki/Sense_and_Sensibility" ...
## [4] <p>A significant transition in her posthumous reputation occurred in ...
## [5] <p>Austen has inspired a large number of critical essays and literar ...
## [6] <p>There is little biographical information about Jane Austen's life ...

Get the text - html_text()

text_data <- para_data %>%
  purrr::pluck(2) %>% # get the second paragraph
  rvest::html_text() # convert the paragraph to text
head(text_data)
## [1] "Jane Austen (/ˈɒstɪn, ˈɔːs-/; 16 December 1775 – 18 July 1817) was an
English novelist known primarily for her six major novels, which interpret,
critique and comment upon the British landed gentry at the end of the 18th
century. Austen's plots often explore the dependence of women on marriage in
the pursuit of favourable social standing and economic security. Her works
critique the novels of sensibility of the second half of the 18th century and
are part of the transition to 19th-century literary realism.[2][b] Her use of
biting irony, along with her realism, humour, and social commentary, have long
earned her acclaim among critics, scholars, and popular audiences alike.[4]"

Xpath Example

  • Right click html code, copy, copy Xpath

Using an Xpath

para_xpath = '//*[@id="mw-content-text"]/div/p[2]'
text_data <- wiki_data %>%
  rvest::html_nodes(xpath = para_xpath) %>%
  rvest::html_text()
text_data

JSpath Example

  • Right click html code, copy, copy JS path

Using CSS ID

para_css = "#mw-content-text > div > p:nth-child(5)"
text_data <- wiki_data %>%
  rvest::html_nodes(css = para_css) %>%
  rvest::html_text()
text_data
## [1] "Jane Austen (/ˈɒstɪn, ˈɔːs-/; 16 December 1775 – 18 July 1817) was an
English novelist known primarily for her six major novels, which interpret,
critique and comment upon the British landed gentry at the end of the 18th
century. Austen's plots often explore the dependence of women on marriage in
the pursuit of favourable social standing and economic security. Her works
critique the novels of sensibility of the second half of the 18th century and
are part of the transition to 19th-century literary realism.[2][b] Her use of
biting irony, along with her realism, humour, and social commentary, have long
earned her acclaim among critics, scholars, and popular audiences alike.[4]"

Text Analysis

possible_nationalities <- c("Australian", "Chinese", "Mexican", "English", "Ethiopian")

# Do any of these nationalities appear in the text?
count_values = str_count(text_data, possible_nationalities)
possible_nationalities[count_values == TRUE]
## [1] "English"

Learnt so far

  • Know how to explore a web page with inspect
  • Know some basics about how to get data

Also know:

  • Can be hard to generalise
  • Formats aren't always standard

Learnt so far

  • Know how to explore a web page with inspect
  • Know some basics about how to get data

Also know:

  • Can be hard to generalise
  • Formats aren't always standard

Back to the original question …

How diversely do we read

1001 books to read

Read the book list from a website

book_list_url <- "https://mizparker.wordpress.com/the-lists/1001-books-to-read-before-you-die/"
paragraph_data <- read_html(book_list_url) %>% # read the web page
  rvest::html_nodes("p") # get the paragraphs
head(paragraph_data)
## {xml_nodeset (6)}
## [1] <p>This list has appeared in several places around the internet, and ...
## [2] <p>If you would like to download a spreadsheet of the list and keep  ...
## [3] <p><strong>21st Century:</strong></p>
## [4] <p>1.  Never Let Me Go – Kazuo Ishiguro<br>\n2.  Saturday – Ian McEw ...
## [5] <p><strong>20th Century:</strong></p>
## [6] <p>70. Timbuktu – Paul Auster<br>\n71. The Romantics – Pankaj Mishra ...

Get the book list from the paragraphs

book_string <- paragraph_data %>%  #the list is in pieces
  purrr::pluck(4) %>% # get the first part of the list
  html_text(trim = TRUE) %>% # convert it to text, remove white space
  gsub("\n", "", .) #remove the newline character
head(book_string)
## [1] "1.  Never Let Me Go – Kazuo Ishiguro2.  Saturday – Ian McEwan3.  On
Beauty – Zadie Smith4.  Slow Man – J.M. Coetzee5.  Adjunct: An Undigest – Peter
Manson6.  The Sea – John Banville7.  The Red Queen – Margaret Drabble8.  The
Plot Against America – Philip Roth9.  The Master – Colm Toibin10.  Vanishing
Point – David Markson11.  The Lambs of London – Peter Ackroyd12.  Dining on
Stones – Iain Sinclair13.  Cloud Atlas – David Mitchell14.  Drop City – T.
Coraghessan Boyle15.  The Colour – Rose Tremain16.  Thursbitch – Alan Garner17.
The Light of Day – Graham Swift18.  What I Loved – Siri Hustvedt19.  The
Curious Incident of the Dog in the Night-Time – Mark Haddon20.  Islands – Dan
Sleigh21.  Elizabeth Costello – J.M. Coetzee22.  London Orbital – Iain
Sinclair23.  Family Matters – Rohinton Mistry24.  Fingersmith – Sarah Waters25.
The Double – Jose Saramago26.  Everything is Illuminated – Jonathan Safran
Foer27.  Unless – Carol Shields28.  Kafka on the Shore – Haruki Murakami29.
The Story of Lucy Gault – William Trevor30.  That They May Face the Rising Sun
– John McGahern31.  In the Forest – Edna O’Brien32.  Shroud – John Banville33.
Middlesex – Jeffrey Eugenides34.  Youth – J.M. Coetzee35.  Dead Air – Iain
Banks36.  Nowhere Man – Aleksandar Hemon37.  The Book of Illusions – Paul
Auster38.  Gabriel’s Gift – Hanif Kureishi39.  Austerlitz – W.G. Sebald40.
Platform – Michael Houellebecq41.  Schooling – Heather McGowan42.  Atonement –
Ian McEwan43.  The Corrections – Jonathan Franzen44.  Don’t Move – Margaret
Mazzantini45.  The Body Artist – Don DeLillo46.  Fury – Salman Rushdie47.  At
Swim, Two Boys – Jamie O’Neill48.  Choke – Chuck Palahniuk49.  Life of Pi –
Yann Martel50.  The Feast of the Goat – Mario Vargos Llosa51.  An Obedient
Father – Akhil Sharma52. The Devil and Miss Prym – Paulo Coelho53.  Spring
Flowers, Spring Frost – Ismail Kadare54.  White Teeth – Zadie Smith55.  The
Heart of Redness – Zakes Mda56.  Under the Skin – Michel Faber57.  Ignorance –
Milan Kundera58.  Nineteen Seventy Seven – David Peace59.  Celestial Harmonies
– Peter Esterhazy60.  City of God – E.L. Doctorow61.  How the Dead Live – Will
Self62.  The Human Stain – Philip Roth63.  The Blind Assassin – Margaret
Atwood64.  After the Quake – Haruki Murakami65.  Small Remedies – Shashi
Deshpande66.  Super-Cannes – J.G. Ballard67.  House of Leaves – Mark Z.
Danielewski68.  Blonde – Joyce Carol Oates69.  Pastoralia – George Saunders"

Let's put our list together

But …. web scraping often means string handling

We want to split the string by any numbers followed by a full stop

Careful:

  • don't want to split book titles with numbers, like Catch 22,
  • don't want to split authors with full stops, like J.R.R Tolkien

Actually bit tricky!

Resources:

Do some string handling

strsplit("a123b", split = "\\d") 
  #Split by digits \\d
strsplit("a123b", split = "\\d+") 
  #Split by one or more digits \\d+
strsplit("a.b", split = "\\.") 
  #Split by fullstop \\.
strsplit("a1.b", split = "\\d+\\.") 
  #Split by digits and fullstop \\d+\\.
strsplit("a1.b", split = "\\d+?\\.") 
  #Matches as few digits as possible \\d+? and fullstop \\.

Check out: https://regexr.com/

Split up our list

split_book_string <- book_string %>% 
  strsplit(split = "\\d+?\\.") %>% 
    # split the string by any numbers followed by a full stop
  as.data.frame(stringsAsFactors = FALSE) %>% 
    # make this a data frame
  dplyr::filter(. != "")  
    # remove any empty rows
head(split_book_string)
##   c........Never.Let.Me.Go...Kazuo.Ishiguro......Saturday...Ian.McEwan...
## 1                                        Never Let Me Go – Kazuo Ishiguro
## 2                                                   Saturday – Ian McEwan
## 3                                                 On Beauty – Zadie Smith
## 4                                                 Slow Man – J.M. Coetzee
## 5                                     Adjunct: An Undigest – Peter Manson
## 6                                                 The Sea – John Banville

Split up our columns

names(split_book_string) <- "book_string"
book_df <-split_book_string %>%
  tidyr::separate(book_string, sep = "\\–", into = c("book", "author"))
  # split our author and book into columns
  # very lucky that whoever coded this webpage used a long hash!
head(book_df)
##                      book          author
## 1        Never Let Me Go   Kazuo Ishiguro
## 2               Saturday       Ian McEwan
## 3              On Beauty      Zadie Smith
## 4               Slow Man     J.M. Coetzee
## 5   Adjunct: An Undigest     Peter Manson
## 6                The Sea    John Banville

Wrap the code chunks

Could vectorise it properly, but we leave that for later.
For now we'll just wrap that our code snippets in a function and use lapply

Get_book_data <- function(para_ind){
  
  book_str <- paragraph_data %>%
    purrr::pluck(para_ind) %>%
    html_text(trim = TRUE) %>% 
    gsub("\n", "", .)  #remove newline character
  
  book_df <- book_str %>% 
    strsplit(split = "\\d+?\\.") %>% #match the number index
    as.data.frame(stringsAsFactors = FALSE) %>% 
    dplyr::filter(. != "")  # remove empty first row
  
  names(book_df) <- "book_string"
  book_df <- book_df %>%
    tidyr::separate(book_string, sep = "\\–", into = c("book", "author"))
  
  return(book_df)
  
}

Put it together

book_data <- lapply(seq(4,12,2) %>% as.list(), Get_book_data) %>% 
  do.call(rbind, .) %>% 
  dplyr::mutate(author = str_trim(author))
nrow(book_data) # Has 1001 rows so let's assume we are all good!
## [1] 1001
head(book_data) # Looks pretty good at first glance
##                      book         author
## 1        Never Let Me Go  Kazuo Ishiguro
## 2               Saturday      Ian McEwan
## 3              On Beauty     Zadie Smith
## 4               Slow Man    J.M. Coetzee
## 5   Adjunct: An Undigest    Peter Manson
## 6                The Sea   John Banville

Now let's get the nationalities of all the authors!

Get the nationality

Also wrap our code up pieces to get the nationality

search_string = "Tim Winton"
wiki_data <- Read_wiki_page(search_string)
infocard <- Get_wiki_infocard(wiki_data)
if(is.null(infocard)){
  nationality <- "Missing infocard"
}else if(any(infocard[,1] == "Nationality")){
  nationality <- Get_nationality_from_infocard(infocard)
}else{
  first_paragraph <- Get_first_text(wiki_data)
  nationality <- Guess_nationality_from_text(first_paragraph,
    possible_nationalities)
}
nationality

What nationalities to search for?

We need a list of nationalities for searching.
Let's get one!

# Get table of nationalities
url <- "http://www.vocabulary.cl/Basic/Nationalities.htm"
xpath <- "/html/body/div[1]/article/table[2]"
nationalities_df <- url %>%
  read_html() %>%
  html_nodes(xpath = xpath) %>%
  html_table() %>% 
  as.data.frame()

possible_nationalities = nationalities_df[,2]
head(possible_nationalities)
## [1] "Afghan"               "Albanian"             "Algerian"            
## [4] "ArgentineArgentinian" "Australian"           "Austrian"

Manual fixing

fix_entry = "ArgentineArgentinian"
i0 = which(nationalities_df == fix_entry, arr.ind = TRUE)
new_row = nationalities_df[i0[1], ]
nationalities_df[i0] = "Argentine"
new_row[,2] = "Argentinian"
nationalities_df = rbind(nationalities_df, new_row)

fix_footnote1 = "Colombia *"
i1 = which(nationalities_df == fix_footnote1, arr.ind = TRUE)
nationalities_df[i1] = strsplit(fix_footnote1, split = ' ')[[1]][1]

fix_footnote2 = "American **"
i2 = which(nationalities_df == fix_footnote2, arr.ind = TRUE)
nationalities_df[i2] = strsplit(fix_footnote2, split = ' ')[[1]][1]

possible_nationalities = nationalities_df[,2]

Get Nationalities

nationality_from_author_search = sapply(book_data$author[1:20],  
                                        function(search_string){
  nataionality = tryCatch( # Just in case!
     Query_nationality_from_wiki(search_string, 
                                 possible_nationalities),
    error = function(e) NA)
  }) %>% unlist()
nationality_from_author_search

Run it!

nationality_from_author_search = sapply(book_data$author %>% unique(),  
                                        function(search_string){
  print(search_string)
  nataionality = tryCatch( # Just in case!
     Query_nationality_from_wiki(search_string, 
                                 possible_nationalities),
    error = function(e) NA)
  }) 
author_nationality_df <- as.data.frame(nationality_from_author_search) %>%
  dplyr::mutate(author = rownames(.))
names(author_nationality_df) <- c("nationality", "author")
book_data <- book_data %>%
  dplyr::left_join(author_nationality_df)
head(book_data)

save(book_data, file = "book_data.RData")

Result

load("book_data.RData")
table_nationalities <- book_data %>% 
  dplyr::select(author, nationality) %>%
  dplyr::distinct() %>%
  dplyr::select(nationality) %>%
  unlist() %>% 
  table() %>% 
  as.data.frame(stringsAsFactors = FALSE)
names(table_nationalities ) = c("Nationality", "Frequency")
table_nationalities %>% 
  arrange(desc(Frequency))
##                                           Nationality Frequency
## 1                                            American       114
## 2                                             English        68
## 3                                             British        57
## 4                                    Missing infocard        47
## 5                                              French        42
## 6                                               Irish        19
## 7                                             Italian        17
## 8                                              German        15
## 9                                            Scottish        12
## 10                                            Russian         9
## 11                                              Dutch         6
## 12                                           Austrian         5
## 13                                            Spanish         5
## 14                                           Canadian         4
## 15                                          Hungarian         4
## 16                                           Japanese         4
## 17                                      South African         4
## 18                                          Brazilian         3
## 19                                             Indian         3
## 20                                             Polish         3
## 21                                        Anglo-Irish         2
## 22                                         Australian         2
## 23                                              Czech         2
## 24                                              Greek         2
## 25                                             Kenyan         2
## 26                                            Mexican         2
## 27                                          Norwegian         2
## 28                                         Portuguese         2
## 29                                            Swedish         2
## 30                                              Swiss         2
## 31                                           Albanian         1
## 32 American (1888–1907, 1956–1959)British (1907–1956)         1
## 33                                 American, Canadian         1
## 34                                          Argentine         1
## 35                                            Belgian         1
## 36                                  British (English)         1
## 37                                 British Australian         1
## 38                                 British citizen[1]         1
## 39               British, emigrated to Australia 1950         1
## 40                                 Bulgarian, British         1
## 41                                           CanadaUS         1
## 42                                  Canadian/American         1
## 43                                            Chilean         1
## 44                                          Colombian         1
## 45                                              Cuban         1
## 46                                     Cuban American         1
## 47                                             Danish         1
## 48                                          Dominican         1
## 49                      Dutch (citizenship), European         1
## 50                                            Finnish         1
## 51                                   French, American         1
## 52                                     German, French         1
## 53                                      German, Swiss         1
## 54                GermanHungarian (by marriage, 1925)         1
## 55                                          Icelandic         1
## 56                                    Igbo of Nigeria         1
## 57                                              India         1
## 58                                            Ireland         1
## 59                                     Irish, British         1
## 60                                Italian, Portuguese         1
## 61                                        New Zealand         1
## 62                      New Zealand (British subject)         1
## 63                    Russian (early years), American         1
## 64                                     Russian empire         1
## 65                        Russian, American and Swiss         1
## 66                           Russian, later Soviet[1]         1
## 67                             South Africa, Botswana         1
## 68               South AfricanAustralian (since 2006)         1
## 69                                       South Korean         1
## 70                                  Swiss and British         1
## 71                                      United States         1
## 72                                              Welsh         1
## 73                                         Zimbabwean         1
head(table_nationalities)
##                                          Nationality Frequency
## 1                                           Albanian         1
## 2                                           American       114
## 3 American (1888–1907, 1956–1959)British (1907–1956)         1
## 4                                 American, Canadian         1
## 5                                        Anglo-Irish         2
## 6                                          Argentine         1

Let's take a look

pie_plot <- table_nationalities %>%
  plot_ly(labels = ~Nationality, values = ~Frequency) %>%
  add_pie(hole = 0.6) %>%
  layout(title = "Nationalities",  showlegend = F,
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

Plotting result

A Few thoughts

  • So many ways to approach this problem
  • This was question for me for fun
  • Approach here was to use the standard rvest toolbox
  • Not perfect - much needed cleaning of nationality strings

What else could we have done

  • Can scrape more data from goodreads website
  • Goodreads has an API
  • Check out the repository by famguy/rgoodreads to get started
  • Using this API makes querying things like year or gender straightforward
  • But goodreads has no nationality, so this solution still is useful!

What else for webscraping

  • There are easier ways to answer this same question
  • Namely, RSelenium for pages with javascript
  • Learning the hard way can be good sometimes though!